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Abstract 

A hybrid scheme that utihzes MPI for distributed memory parallehsm and OpenMP 
for shared memory parallehsm is presented. The work is motivated by the desire 
to achieve exceptionally high Reynolds numbers in pseudospectral computations of 
fluid turbulence on emerging petascale, high core-count, massively parallel process- 
ing systems. The hybrid implementation derives from and augments a well-tested 
scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new 
picture for the domain decomposition of the pseudospectral grids, which is helpful in 
understanding, among other things, the 3D transpose of the global data that is nec- 
essary for the parallel fast Fourier transforms that are the central component of the 
numerical discretizations. Details of the hybrid implementation are provided, and 
performance tests illustrate the utility of the method. It is shown that the hybrid 
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scheme achieves near ideal scalabihty up to ~ 20000 compute cores with a maxi- 
mum mean efficiency of 83%. Data are presented that demonstrate how to choose 
the optimal number of MPI processes and OpenMP threads in order to optimize 
code performance on two different platforms. 

Key words: computational fluids, numerical simulation, MPI, OpenMP, parallel 
scalability 



1 Introduction 

Fluid turbulence arises from interactions at all spatial and temporal scales, 
and is therefore the quintessential petascale application. The Reynolds num- 
ber Rv, which measures the strength of the nonlinearity in turbulent fluid 
systems, determines the number of degrees of freedom (d.o.f.) required to 
resolve all spatial scales, which increases as Rv^/^ (in the Kolmo gorov frame- 
work [TT][TO] ). For geophysical flows, Rv is often greater than 10^, suggesting 
the need to evolve the geo-fiuid equations with greater than 10^^ grid points, 
if completely accurate computations of turbulent geophysical flows are to be 
realized without resorting to modeling of unresolved scales. This approach to 
computing fluid flows in which all spatial and temporal scales are resolved 
is called direct numerical simulation (DNS). If the goal is to simulate geo- 
physical flows accurately, such computations must be carried out at exascale 
resolutions, which are not currently feasible. But petascale resolutions are just 
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now becoming available, that can accommodate resolutions of 10^^ grid points, 
corresponding to Rv ~ 10^, which still allows for sufficient scale separation to 
study physically relevant complex turbulent flows. 

Pseudospectral methods provide a very useful tool to study the problem 
because of their computational efficiency and high order numerical conver- 
gence. Attention is often focused on a 27r-periodic box domain in order to 
study scale interaction as it allows the use of fast spectral transforms that 
have a computational complexity of ~ A^log(A^) instead of ~ A^^, where TV 
is the linear resolution. For studies of homogeneous and isotropic turbulence, 
this choice is entirely consistent because the domain preserves the underly- 
ing translational and rotational invariance of the physics. But the approach is 
useful as well for studies of anisotropic or inhomogeneous turbulence, which 
broadens its usefulness. On the periodic domain, the Fourier basis is optimal, 
and the pseudospectral discretization [THTIIH] is pre-eminent due to the effec- 
tiveness of the fast Fourier transform (FFT) in converting from configuration 
to spectral space, and back again. The pseudospectral method [12] has thus 
been used extensively in studies of computational fluid dynamics (CFD) in- 
cluding turbulence, with references too numerous to cite. This method has the 
extra advantage of accurately capturing the interaction of multiple scales with 
little or no numerical dissipation or dispersion. This is clearly an important 
property for the numerics if we wish to quantify small scale dissipative effects 
that arise in the context of nonlinear turbulent interactions. 

Pseudospectral methods, however, require global spectral transforms, and, 
therefore, are hard to implement in distributed memory environments. This 
has been labeled a crucial limitation of the method until domain decomposition 
techniques arose that allowed computation of serial FFTs in different direc- 
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tions in space (local in memory) after performing transpositions. One of these 
methods is the ID (slab) domain decomposition (see e.g., [2]), that enables 
multidimensional FFTs to be parallelized effectively using the Message Pass- 
ing Interface (MPl). However, these methods are often limited in the number 
of processors that can be used, and generalizations to larger processor counts 
using solely MPI are often expensive or hard to tune as transpositions require 
all-to-all communications. Also, multi-dimensional transforms of some non- 
Fourier basis, such as spherical harmonics, cannot be parallelized using this 
technique. In the present work, a hybrid (MPI-OpenMP) scheme is described 
that builds upon the existing domain decomposition scheme that has been 
shown to be effective for parallel scaling using MPI alone. We leverage this ex- 
isting domain decomposition method in constructing a hybrid MPI-OpenMP 
model using loop-level OpenMP directives and multi-threaded FFTs. The im- 
plementation is intended to address several concerns: It addresses the multi — 
level architectures of emerging platforms; and it is also designed to be portable 
to a variety of systems, with the expectation that it will provide scalability 
and performance without detailed knowledge of network topology or cache 
structure. 

The idea of such loop-level-or implicit-parallelization in concert with MPI 
is not new. To date, these have generally been attempted on small core count 
systems, and the pure MPI scheme is found to outperform the hybrid schemes. 
In the context of CFD applications, it was found that on core counts up 
to 256 processors the overall elapsed time (for a finite element solver) was 
better for the pure MPI scheme than for the hybrid, even though the hybrid 
approach showed improved communication times in some cases [16] . A hybrid 
approach was taken in an implementation of a parallel 3D FFT algorithm 
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[H] that succeeded in reducing the number of cache misses in the algorithm 
on an SMP system. But this approach was again tested only on a small core 
count platform, and considered the FFT algorithm alone, without the full fluid 
solver. To the best of our knowledge, the scheme described herein is the first 
published implementation of a hybrid model in a pseudospectral CFD context 
that has been attempted on high core count systems, and found to scale well. 

In the following sections we present a new hybrid implementation. We begin 
first with a description (Section[2]) of the numerical method and the underlying 
domain decomposition scheme. In SectionOthe hybrid model is presented, and 
a new domain decomposition picture is offered for viewing the distribution of 
work on multicore nodes. We also discuss in this section the implementation of 
the loop-level parallelization. Benchmarks are provided in Section HI where we 
also consider the overhead and performance of the OpenMP parallelization, 
and the scalability of the full hybrid formulation. Finally, in Section |5l we offer 
some concluding remarks on lessons learned and our expectations for future 
hybrid performance on petascale systems. 

2 The pseudo-spectral method and the underlying domain decom- 



AU of the work in this paper will be based on simulations of the Navier- 
Stokes equations: 



position 



dtu + u • Vu 



Vp + z/V^u, 



(1) 



V-u = 0, 



(2) 
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where u is the velocity, the kinematic viscosity is i^, and the pressure, p, can be 
viewed as a Lagrange variable used to satisfy the incompressibility constraint, 
(Eq. These equations are solved using a pseudo-spectral method [T|7|5|T3] . 
in which each component of u is represented as a truncated (Galerkin) ex- 
pansion in terms of the Fourier basis, and the nonlinear term is computed in 
physical space and then transformed using the fast Fourier transform (FFT), 
to spectral space. The nonlinear term is computed in such a way that the 
velocity is projected onto a divergence- free space, in order to satisfy (Eq. |2]). 
Details of this projection and of the dealiasing required by the action of the 
nonlinear term are not central to the discussion and can be found elsewhere 
p^flS] , as can additional details of the discretization and parallelization of the 
scheme using solely MPI [B]. 

The key piece of any pseudospectral method, particularly for parallel com- 
puting, is the multidimensional Fourier transform algorithm. An efficient par- 
allel implementation of this algorithm is essential for attaining high Reynolds 
numbers in turbulent hydrodynamics simulations, which is of chief concern 
here. We focus on a 3D Fourier transform of a scalar (or vector component) 
field of size A^^, with nodes in each coordinate direction of the 27r-periodic 
domain. The distribution of real space points can be viewed as a cubic array 
of A^'^ real numbers. In the underlying domain decomposition each processor 
receives a "slab" of size N x N x M node points, where M = N/Np, and Np 
is the number of processors. This is referred to as a ID domain decomposition 
because the distribution to processors occurs in one direction only; this de- 
composition is visualized in Fig. [H Fourier transforms are performed locally 
in the direction of the arrows on the slab owned by a processor. The partially 
transformed (complex) data resides in a cube of size {N/2 + 1) x N x M. The 
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Fig. 1. Underlying ID (slab) domain decomposition for pseudo-spectral method 
(left). Each processor works on a slab of size N x N x AI, where M = N/P. The 
FFT is done by first doing the FFTs locally in each slab, in the directions specified 
by the arrows, yielding partially transformed data of size N/2 + 1 x N x M. Then, 
an all-to-all communication is done to transpose the data globally (right), so that 
the remaining ID FFT can be done in the direction specified by the arrow. The 
data for this step is stored in a cube of size N/2 + 1 x N x N, and each processor 
now computes the FFT locally in a slab of size P = {N/2 + 1)/P. [Figure adapted 
from [6].] 



reduction in the size of the array results from the fact that a Fourier transform 
of real data u{x) satisfies u{k) = —u*{k) (where the asterisk denotes complex 
conjugate), and therefore only half the numbers need to be stored. To compute 
the (complex) transform in the remaining direction, an all-to-all communica- 
tion is carried out in order to transform the global data cube, and decompose 
it into sUces of size P x N x N, where P = {N/2 + 1)/Np. Non-blocking 
MPI communication is used for the all-to-all exchange. This communication 
allows the transform to be carried out in the remaining direction (seen on the 
right in Fig. [1]) locally on each processor. Besides using non-blocking calls, it 
is important to make the communication in an ordered way that ensures com- 
munication balance. In [2], a list of all possible pairs of MPI tasks is created 
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to this end. Such a hst may create problems for large processor counts, and as 
a result here we implement the scheme shown in Fig. [2l Local FFTs are then 
computed using the open source FFTW package 

The ID domain decomposition scheme scales efficiently ( [2|6] ). and, when 
properly implemented, minimizes the number of all-to-all communications 
that must be done to complete the transpose. However, it also limits the 
number of processors to the maximum number of MPI processes that can be 
used, which is the linear resolution of the run, A^. In practice, departures from 
linear scaling are often observed before reaching MPI tasks, as the ratio 
of computing to communication time decreases. We address these issues in 
Section [31 

3 Implementation of the hybrid scheme 

The growing tendency for petascale platforms is toward a hierarchical shared- 
memory node structure with each node having multiple sockets, each with in- 
creasing numbers of compute cores with shared or separate caches, and which 
may be encapsulated within a non-uniform memory access (NUMA) domain 
within the node. This hierarchical design seems especially suited to a multi- 
level domain decomposition scheme that can be optimized for the hierarchi- 
cal hardware [9j. In order to address these emerging system designs, and to 
rectify the limitation in the underlying slab-only pseudo-spectral domain de- 
composition strategy of Section [21 which prevents scaling to processor counts 
beyond the number of MPI processes (linear resolution of the problem), we 
use OpenMP to improve the compute time of each MPI task. In this scheme, 
the MPI processes provide a coarse-grain parallelization using the slab do- 
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MPI task 
MPI task 1 
MPI task 2 
MPI task 3 
MPI task 4 

Fig. 2. Communication pattern for the all-to-all MPI communication to perform 
the transposition in the parallel FFT. Loops are executed in which point-to-point 
MPI communication (non-blocking send and receive) are performed with increasing 
stride between jobs, until all communications are performed. In the hybrid case, 
each MPI task can spawn several threads, and the communication is handled by the 
main thread. 

main decomposition described above, but OpenMP loop-level constructs and 
multi-threaded FFTs are applied within each MPI job to provide an inner 
level of parallelization. 

Figure [3] illustrates the two-level parallelization scheme. Each MPI task is 
parallelized by distributing work among a number of threads (TO . . . T3 in the 
figure), in possibly two different ways. This work distribution is provided by 
constructing parallel regions at the loop level using OpenMP directives. From 
the point of view of the outer level of parallelization, the multidimensional 
FFT discussed in Section [2] does not change. To show specific inner-level 
parallelization and to present the origin of the two different ways to look at the 
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(c) 

Fig. 3. Schematic of the new two-level domain decomposition strategy, (a) The ID 
domain decomposition now acts as a coarse-grain MPI-based domain decomposition 
step, (b) and (c) A single slab (owned by a single MPI task) is further parallelized in 
one of two ways by loop-level OpenMP directives that distribute different "chunks" 
of the slab to different threads (here, labeled TO . . . T3) to be worked on, speeding 
up the MPI task. Multi-threaded FFTs are also used in each slab. 

decomposition, we provide here a code fragment showing the use of OpenMP 
directives in carrying out the transpose within a slab, crucial for computing 
the FFT. We focus on this particular algorithm because of its importance for 
the performance of the parallel FFT and also also because it provides a good 
opportunity to highlight an important feature of the code: 

! Multi-threaded FFTs are computed 

! All-to-all MPI communication handled by the master thread 
ITramsposition is now done locally: 

!$omp parallel do if ((iend-ibeg)/csize.ge.nthrd) private (j j ,kk,i, j ,k) 
DO ii = ibeg, lend, csize 
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!$omp parallel do if ( (iend-ibeg)/csize . It .nthrd) private (kk,i,j,k) 
DO jj = l,N,csize 
DO kk = l,N,csize 

DO i = ii ,min(iend, ii+csize-1) 
DO j = j j ,niin(N, j j+csize-1) 
DO k = kk,min(N,kk+csize-l) 

out(k, j ,i) = cl(i, j ,k) 
END DO 
END DO 
END DO 
END DO 
END DO 
END DO 

Here, the indices ibeg and iend indicate the starting and stopping indices 
that define the slab for the initial domain decomposition of the data cube. 
The quantity csize refers to the cache-size, which is tunable. The outer loop 
is distributed among threads if the number of planes comprising the slab is 
greater than or equal to the number of threads, nthrd times the cache size 
of each thread. The use of this directive suggests a decomposition scheme like 
that illustrated in Fig. |3]^b). If the number of planes is less than nthrd* csize, 
then the inner loop is parallelized, which provides a domain decomposition 
scheme represented by Fig. I^c). In this way, we minimize the effect of a 
potential load imbalance. 

This example not only shows explicitly how loop-level parallelization is 
achieved, but also demonstrates one of the ways in which effective cache uti- 
lization is achieved in the local transposition of data by using a technique often 
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referred to as "cache-blocking." The three outer loops ensure that the data 
handled by the inner loops is small enough to fit in cache. Since the cache size 
is tunable, this procedure for cache-optimization does not depend on whether 
the thread cache is shared or separate. It has been recognized [9] that the 
hybrid multi-level domain decomposition scheme may be especially valuable 
when taking cache optimization into account. All other loops in the code are 
modified with similar OpenMP directives, although most do not need to im- 
plement cache-blocking and the csize dependency. As a result, the remaining 
loops are parallelized as 

!$omp parallel do if ( (iend-ibeg) . ge .nthrd) private (j,k) 

DO i = ibegjiend 
!$omp parallel do if ( (iend-ibeg) . It .nthrd) private (k) 
DO j = 1,N 
DO k = 1,N 

lOperations over arrays with indices ordered as A(k,j,i) 
END DO 
END DO 
END DO 

The reason for this is that, unlike in the case of the transpose, most of the 
other loops load long lines of contiguous data into cache directly because 
they have no mixed-index dependencies; the transpose requires special treat- 
ment because of the dependence of a given block of memory on other non- 
contiguous blocks. Note that in all cases, the loops are ordered-like the above 
code fragment-so that the largest index range keeps the cache lines full. Only 
a few loops in the code (mostly associated with computation of global quan- 
tities or spectra which require reductions) have to be parallelized using the 
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OpenMP ATOMIC directive. 

In both examples, the choice of parallelizing the outer or middle loops based 
on workload per MPI task can be replaced by a COLLAPSE clause in OpenMP 
3.0. This clause can be used to parallelize nested loops as the ones shown above 
with only one OpenMP parallel directive. Both solutions have been bench- 
marked on different platforms and we observe similar timings. As a result, 
given the fact that the COLLAPSE clause is only available in compilers that 
support the new OpenMP standard, we will use the approach described above 
in the following examples to ensure portability of the code. 

Besides the loop-level parallelism, the FFTs in each slab are also parallelized 
using the multi-threaded version of the FFTW libraries. MPI calls and I/O 
calls are only executed by the main thread in each MPI task. One of the 
additional benefits of the hybrid scheme presented here is that, by reducing 
the number of MPI processes, we reduce not only the number of MPI calls, 
but also the amount of data that must be communicated, and hence the size 
of the MPI buffers required to store data. This also allows us to use parallel 
MPI I/O in environments with tens of thousands of cores, as the number of 
MPI tasks is a fraction of the total number of cores used. We will present cases 
where these considerations become significant in Section H] where we provide 
performance results for the scheme. 

4 Scalability and performance 

A variety of tests have been performed to characterize the overhead, perfor- 
mance and scalability of the new hybrid domain-decomposition method. Tests 
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were conducted primarily on two platforms: the bluefire system at the National 
Center for Atmospheric Research (NCAR), and the kraken system at the Na- 
tional Institute for Computational Sciences (NICS). The bluefire platform is 
an IBM Power 575 system, with 128 compute nodes, each of which contains 
16 sockets with Power6 processors with 2 cores each. The compute nodes are 
interconnected with InfiniBand; each node has eight 4X InfiniBand double 
data rate (DDR) links. The kraken system is a Cray XT5 with 8256 compute 
nodes. Each compute node has two six-core AMD Opteron processors for a 
total of 99072 cores. The compute nodes are interconnected with a 3D torus 
network (SeaStar). All of the tests discussed here operate in benchmark mode, 
for which no output other than timings are produced, and all solve Eqs. [T]l2] 
for about 50 timesteps. Times are measured using the FORTRAN cpu_time 
routine, and the OpenMP routine omp_get_wtime. Timings presented below 
measure only the average time per timestep for the main time-advance loop; 
the initialization time (including the configuration of FFTW) is not included. 

In the first series of tests, we consider the overhead and performance of 
OpenMP. The first test thus considers a single MPI process, and variable 
number of threads nthd with a fixed linear resolution of = 256. The results 
are presented in Fig. HI The performance for 1 and 2 threads is comparable 
for both platforms. After this, bluefire communicates out-of-socket, and its 
scaling decreases. We expect that as the core counts increase for this platform 
{e.g. , as for the Power7 system) this problem will not be as severe. For kraken, 
there are 6 cores per socket, but we still see very good scaling to about 7 
threads. Moreover, the departures from the ideal scaling observed in kraken 
while computing in-socket seem to be associated with the hardware (e.g., with 
saturation of the memory bandwidth) and not specifically with OpenMP. This 
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Fig. 4. Timing results with a single MPI task and multiple threads on two platforms. 
Both scale roughly the same for one and 2 threads, but afterward, the bluefire jobs 
run out-of-socket resulting in poor scaling. The kraken runs scale well up to 7 — 8 
threads, even though there are only 6 cores per socket. The dotted line shows the 
timings for a single thread, while varying the number of MPI processes (hence, the 
nthd axis refers to the number of MPI tasks for this curve only). The dashed line 
represents ideal scaling, and is also used in all subsequent scaling plots. 

we conclude from tests in which OpenMP parallelization is turned off, and the 
number of MPI tasks is increased. In using from 1 to 6 MPI tasks, the same 
scaling is observed as with pure OpenMP parallelization (Fig. H]). 

In order to examine effects of OpenMP overhead on bluefire results more 
closely, we compare two runs at different resolutions, one at = 256, and one 
at = 512 for a series of thread counts. These results are given in Fig. |5l In 
each plot the symbols refer to the same nthd, and Np is varied by changing 
the number of MPI processes. The MPI tasks were bound to processors (using 
"processor binding"), and symmetric multi-threading (SMT) was disabled. 
The first observation is that, for = 256, the gains as nthd is increased (for 
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Fig. 5. Two sets of bluefire runs, for = 256 (left) and for N = 512 (right). 
The curves represented by the different symbols are runs at a constant number of 
threads, as given in the legend. Note, in particular, that the difference in run time 
between the 1- and 2-thread surveys are smaller for the runs with N = 512, than 
for the cases where = 256, which suggests that the thread overhead will manifest 
itself with smaller work load on this platform. Also note tje almost linear scalability 
up to 1000 processors. 



any fixed number of MPI tasks) are roughly the same as the ones reported in 
Fig. m for only 1 MPI task. However, for large numbers of MPI tasks, using 
nthd = 2 gives better timings than nthd = 1 using the same total number 
of processors (e.g., compare the triangle and the square at Np = 256 with 
the square at Np = 128). Increasing the number of threads further does not 
give substantial speed-ups. This is observed more clearly in the = 512 runs. 
In this case, the slope between runs with nthd = 1 and nthd = 2 is larger, 
indicating better gains as the size of the problem is increased. 

As a result, in bluefire there appears to be an effect due to the thread 
overhead that is noticeable when using 2 threads (in-socket) and the problem 
size is small: we see that the differences in run time between the 1-thread and 
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Table 1 

Efficiency of runs with N linear resolution in bluefire, taking as reference runs with 
Npq = N/2 cores and nthd = 1 threads. 



N 



Np = N, nthd = 1 Np = N, nthd = 2 



256 



0.54 



0.59 



512 



0.58 



0.65 



1024 



0.63 



0.66 



2-thread runs is smaller as resolution is increased. Then, as the threads are 
out-of-socket, extra overhead appears (although for fixed number of threads, 
very good scahng is found with increasing number of MPl tasks). This can be 
further observed considering runs with = 1024. Table [1] shows the efficiency 



where T is the time per time step, and Np^ and To are respectively the number 
of cores and times measured in a reference run; we consider Np^ = N/2 with 
nthd = 1 as the reference run. For a fixed number of processors Np = N, the 
efficiency is best if two threads are used instead of one, and as resolution is 
increased efficiency improves. If Np = 2N and four threads are used, efficiency 
also increases but is at most ^ 0.4 for = 1024. 

These results suggest that a hybrid approach may be most useful for large 
enough simulations in environments with large processor counts and when a 
large number of cores is available in the same socket. To verify this we consider 



NpT 



(3) 



e = 
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the scaling to high core counts on kraken. For these runs, we set = 1536, = 
3072, and = 4096, with nthd = 6 or 12. At these resolutions, simulations 
with 1 thread cannot be executed as there is not enough memory per core in 
kraken to allocate the arrays. Several simulations at lower resolution (A^ = 
512) were done to explore configuration parameters. This mainly involved 
NUMA options in the compiler (PGI), different binding configurations, MPI 
environment settings, and distribution of jobs among processors. We observed 
no substantial differences in the timings when changing the job distribution. 
Binding processors when nthd = 6 using the run command instead of NUMA 
options in the compiler was found to be best, although by a small margin 
(~ 5%). The implementation of MPI on kraken can also be configured to do 
fast copies in memory of the data when sending and receiving large messages. 
This gives a substantial speed up of the code (8 - 10%) but was found to require 
large amounts of memory that created problems with the largest resolutions. 
As a result, to compare on an equal footing, the runs described below were 
compiled with -02 optimization, without using fast memory copy in MPI, and 
using the run command 

aprun -n $NMPI -S 1 -d $OMP_NUM_THREADS executable 

for the nthd = 6 runs, and the run command 

aprun -n $NMPI -d $OMP_NUM_THREADS executable 

for nthd = 12. In the former case, the -S 1 option tells aprun to bind one MPI 
process per socket, and NMPI refers to the total number of MPI processes. It 
should be noted that in the runs for which the most aggressive optimization 
options can be used (e.g., if enough memory is available), improvements in 
the times of up to 20% were found. 
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Fig. 6. Scalability timings for two sets of runs on kraken. The squares and crosses 
represent the timings for a = 1536 run using 6 and 12 threads, respectively. The 
triangles, and diamonds represent the same for a run size of = 3072. Note the 
cross-over in performance at about Np = 10000 where the nthd = 12 configuration 
outperforms the nthd = 6 configuration for both problem sizes. 



The results are given in Fig. |6l We see that good speedup is achieved up to 
~ 20000 cores, but there are some interesting observations. The mean parallel 
efficiencies for these runs are 76, 83, 61, and 71%, respectively, from top to 
bottom in the legend of Fig. |6l Maximum efficiencies observed are slightly 
above 1, indicating super-scaling for some configurations. The efficiency mea- 
sured for the jobs with larger processor count is given in Table O Based on 
the results shown in Fig. |U we can expect optimal results for nthd = 6. How- 
ever, the runs with nthd = 12 scale uniformly better than the nthd = 6 
runs. Moreover, from the point of view of actual timings the nthd = 6 cases 
perform better up to a certain point at which the nthd = 12 runs outperform 
them; this point appears to occur at about the same Np for both sets of runs. 
This is particularly clear for the run with N = 3072, where the same times 
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Table 2 

Efficiency of runs witli = 3072 linear resolution in kraken, taking as reference 
runs with Np^ = Np/2 cores and same number of threads. 



Np = 9216, nthd = 6 Np = 9216, nthd = 12 Np = 18432, nthd = 12 
0.58 0.61 0.72 



are measured for Np ~ 10000 using nthd = 6 and nthd = 12. The results 
are consistent with the findings in bluefire, but the larger processor count and 
cores-per-socket in kraken allow us to obtain significant gains using the hybrid 
approach in the latter case. We conclude that if the workload per MPI process 
becomes too small, it is better to use more threads even if this puts threads 
out-of-socket. 

5 Discussion and conclusion 

We have presented a hybrid MPI-OpenMP model for a pseudo-spectral 
CFD code. Beginning with an underlying "slab" domain decomposition ade- 
quate for parallelization by MPI, we have shown how the basic method is mod- 
ified by loop-level parallelization to create a two-level parallelization scheme. 
The new level of parallelization can be thought of as modifying the under- 
lying domain decomposition scheme, and we have pointed out precisely how 
this has been done depending on the size of the problem, number of threads, 
and number of MPI tasks. 

The hybrid code has been tested primarily on two systems: the IBM Power6 
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system bluefire at NCAR, and the Cray XT5 system kraken at NICS. We have 
tested the thread overhead and performance, and found hmitations of small 
socket core counts in bluefire. We have also discovered that there is a resolution 
threshold, N, below which the thread overhead manifests itself more clearly on 
bluefire and reduces scalability. In terms of large core counts, our results show 
good scalability up to about 20000 processors on the kraken system. For large 
enough problems, we find the best scalability when the number of threads is 
12 (one MPI process per compute node). On the other hand, we find that 
the performance time is better when nthd = 6, until the workload per MPI 
process is large enough, at which point, the performance time is better for the 
case where nthd = 12. We find that, for a given MPI/OpenMP configuration, 
and a given resolution, the results are consistent from run-to-run, with little 
fluctuation in terms of scalability or run time. 

Our experimentation has suggested a number of ways in which to improve 
the compute time of kraken runs. Perhaps the most important of these involve 
configuring the MPI environment. For the large message sizes we are using, 
setting the MPI environment variable MPICH_FAST_MEMCPY yields an 8 - 
10% speedup over runs that do not use it, but it requires significantly more 
buffer memory. This increase in buffer requirements can prevent the code at 
large resolutions from fitting into memory, and must be considered carefully 
before attempting a production run. As an example, for = 4096 using 
this configuration, we could only execute the code using 24576 processors and 
6 threads, and any other distribution using the same or smaller number of 
processors and changing the number of threads failed because of insufficient 
memory. We have not attempted larger resolution runs yet, but we note that 
the memory issue addressed here will become more of a concern the larger A^ 
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becomes. 



The hybrid scheme introduced here is not the only way in which to decom- 
pose the pseudo-spectral grid. An alternative is to retain a pure MPI model 
as in [15]. In this model, the domain decomposition takes the form of "pen- 
cils" which yield a 2D (A^^) distribution among MPI processes, and OpenMP 
is not required. This technique is also found to scale well [5] to large core 
counts, although severe fluctuations in performance are observed even within 
a given processor-domain mapping. The pure MPI model does not suffer from 
effects of thread overhead (thread re-starts and synchronization) that we ob- 
serve in smaller resolution runs, nor from potential problems with compiler 
optimizations that may arise when OpenMP is used [9]. Nevertheless, the 
hybrid method described here can be applied to non-Fourier basis spectral 
methods which may be impossible to parallelize with the 2D distribution {e.g. 
, spherical harmonics). As pointed out in Section [3l our hybrid method offers 
a two-level parallelization method that may be more effective in mapping the 
domain to the hierarchical architectures that are now emerging, and better 
suited for environments with multiple cores per socket. Indeed, as noted in 
Section |H based on our tests on bluefire and kraken we expect the hybrid ap- 
proach to provide better performance results in coming years, as the number of 
cores per socket continues to increase. The hybrid scheme may also aid in the 
MPI memory problems mentioned above, in that fewer MPI processes require 
less buffer memory. We intend in the future to continue testing this method 
to higher resolution as accessibility to a larger number of processors becomes 
more readily available. Since the code described here integrates the Navier- 
Stokes or magnetohydrodynamics equations when coupling to a magnetic field, 
including rotation, the hybrid scheme we have developed will prove useful in a 
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variety of geophysical and astrophysical phenomena. And finally, we note that 
this approach works well even if the aspect ratio of the computational domain 
is not equal to unity. 
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